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Abstract  - Sampling  variability  results  in  uncertainties  of  measures.  The  nonparametric  two- 
sample  bootstrap  method  has  been  used  to  compute  uncertainties  of  measures  in  receiver 
operating  characteristic  (ROC)  analysis  on  large  datasets,  such  as  the  standard  error  (SE)  of  the 
equal  error  rate  in  biometrics,  the  SE  of  a detection  cost  function  in  speaker  recognition 
evaluation,  and  others.  Specifically,  the  SE  of  the  area  under  ROC  curve  (AURC)  can  be 
computed  analytically  using  the  Mann- Whitney  statistic.  It  can  also  be  calculated  using  the 
nonparametric  two-sample  bootstrap  method.  The  analytical  result  could  be  treated  as  a ground 
truth.  The  relative  errors  of  bootstrap-method  results  with  respect  to  the  analytical-method  results 
using  different  matching  algorithms  were  examined,  and  they  were  quite  small.  Hence,  this 
validates  the  nonparametric  two-sample  bootstrap  method  applied  in  ROC  analysis  on  large 
datasets. 


Index  Terms  — ROC  analysis,  bootstrap,  area  under  ROC  curve,  uncertainty,  standard  error, 
biometrics,  speaker  recognition. 
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1 Introduction 


Sampling  variability  results  in  uncertainties  for  all  measurements.  That  is,  multiple  sample  sets 
are  collected  under  the  similar  conditions,  and  then  statistical  measures  will  vary  across.  Indeed, 
the  measurement  uncertainty  quantification  is  a very  important  issue.  Hence,  when  evaluating 
and  comparing  the  performance  of  algorithms,  the  measurement  uncertainties  must  be  taken  into 
account  [1-3]. 

Receiver  operating  characteristic  (ROC)  analysis  is  an  important  statistical  technique  in  many 
areas.  The  nonparametric  two-sample  bootstrap  method  has  been  used  to  compute  uncertainties 
of  measures  in  operational  ROC  analysis  on  large  datasets,  such  as  the  standard  error  (SE)  of  the 
equal  error  rate  (EER)  in  biometrics,  the  SE  of  a detection  cost  function  in  speaker  recognition 
evaluation,  etc.,  based  on  our  extensive  bootstrap  variability  studies  on  large  datasets  [2],  The 
detection  cost  function  is  defined  as  a weighted  sum  of  probabilities  of  type  I error  and  type  II 
error  [4],  It  has  been  hard  to  calculate  uncertainties  of  these  statistics  of  interest  without  using  the 
two-sample  bootstrap  method  [3]. 

The  area  under  an  ROC  curve  (AURC)  is  an  important  metric  in  ROC  analysis  [5,  and  references 
therein].  The  AURC  corresponds  to  the  probability  of  correctly  identifying  which  of  the  two 
stimuli  is  more  likely  than  the  other.  It  measures  the  overall  ROC  curve  rather  than  the 
performance  at  a particular  operational  point  on  the  ROC  curve.  Moreover,  if  it  is  computed 
using  the  trapezoidal  rule,  the  AURC  is  equivalent  to  the  Mann-Whitney  statistic  that  is  formed 
by  matching  scores,  namely,  genuine  and  impostor  scores  in  biometrics,  or  target  and  non-target 
scores  in  speaker  recognition  evaluation,  etc.  Hence,  the  variance  of  the  Mann-Whitney  statistic 
can  be  utilized  as  the  variance  of  the  AURC.  In  other  words,  the  SE  of  AURC  can  be  computed 
analytically.  This  analytical  approach  is  a deterministic  process  and  thus  the  result  is  unique. 

The  Mann-Whitney  statistic  is  asymptotically  normally  distributed,  regardless  of  the 
distributions  of  matching  scores,  thanks  to  the  Central  Limit  Theorem.  Thus,  the  Z statistic 
formulated  in  terms  of  two  AURCs  along  with  their  SEs  and  correlation  coefficient  is  subject  to 
the  standard  normal  distribution  and  can  be  used  to  test  the  significance  of  the  difference  of  two 
ROC  curves  [6,  7,  and  references  therein]. 

On  the  other  hand,  the  SE  of  AURC  can  also  be  calculated  using  the  nonparametric  two-sample 
bootstrap  method  [1-3,  8-10].  Unlike  the  analytical  approach  using  the  Mann-Whitney  statistic, 
the  bootstrap  method  is  a stochastic  process.  In  other  words,  the  result  will  change  for  different 
runs.  Thus,  rather  than  dealing  with  a single  measure  of  the  SE  of  AURC  in  the  analytical 
approach,  the  results  derived  from  the  bootstrap  method  constitute  a probability  distribution. 
Some  results  may  be  more  probable  and  others  less.  However,  if  the  analytical  result  is  treated  as 
the  ground  truth  and  if  the  relative  errors  of  the  bootstrap  results  with  respect  to  the  analytical 
results  are  not  large,  this  can  validate  nonparametric  two-sample  bootstrap  method  used  in 
computing  the  uncertainties  of  some  statistics  of  interest  on  large  datasets,  which  cannot  be 
calculated  otherwise. 

The  nonparametric  two-sample  bootstrap  method  is  particularly  of  interest  in  operational  ROC 
analysis  on  large  datasets.  The  two  samples  are  referred  to  as  a set  of  genuine  (i.e.,  target)  scores 


and  a set  of  impostor  (i.e.,  non-target)  scores.  They  constitute  two  distributions.  An  ROC  curve  is 
characterized  by  the  relative  relationship  between  these  two  distributions  [5,  11].  These  two 
distribution  functions  are  indeed  interrelated  by  the  algorithm  that  generates  them.  In  other 
words,  the  performance  of  a matching  algorithm  is  affected  not  only  by  genuine  matching  but 
also  by  impostor  matching.  All  statistics  of  interest  in  ROC  analysis  are  influenced  by  the 
combined  impact  of  these  two  samples. 

Furthermore,  it  was  shown  in  our  previous  studies  that  1)  these  two  distributions  usually  do  not 
have  well  defined  parametric  forms;  2)  the  shapes  of  these  two  distributions  may  be  considerably 
different  for  the  same  algorithm;  and  3)  the  distributions  may  vary  substantially  from  algorithm 
to  algorithm  in  a way  that  differentiates  algorithms  in  terms  of  matching  accuracy  [5],  This 
suggests  that  the  nonparametric  statistical  analysis  be  appropriate  for  analyzing  such  data.  Thus, 
the  empirical  distribution  is  assumed  for  each  of  the  observed  scores. 

As  is  well-known,  the  bootstrap  method  assumes  that  an  independent  and  identically  distributed 
(i.i.d.)  random  sample  of  size  n is  drawn  from  a population  with  its  own  probability  distribution. 
Our  large  government  data  bases  used  for  developing  similarity  scores  in  fingerprint  technology 
were  randomly  collected  from  real  practice  rather  than  using  multiple  acquisitions  and  thus  had 
no  dependencies.  Thus,  the  random  sample  is  assumed  to  be  i.i.d.  in  our  work. 

With  the  i.i.d.  assumption,  the  objects  of  a nonparametric  two-sample  bootstrap  are  individuals 
in  the  sample  [2,  3],  Otherwise,  the  bootstrap  objects  are  the  subsets  of  the  sample  into  which  the 
sample  is  grouped  based  on  data  dependencies  caused  by  multiple  biometric  acquisitions  [12, 
13].  This  can  preserve  the  dependencies  among  the  data.  However,  everything  else  in  the 
bootstrap  method  remains  intact.  Of  course,  how  the  sample  is  grouped  into  subsets  will  have 
impact  on  the  bootstrap  results.  As  a matter  of  fact,  from  the  statistical  point  of  view,  the  sample 
should  be  collected  as  randomly  as  possible  in  test  design. 

The  number  of  bootstrap  replications  is  a very  important  parameter  in  bootstrap  method.  In  order 
to  reduce  the  bootstrap  variance  and  ensure  the  accuracy  of  the  computation  in  our  applications 
where  the  size  of  data  samples  is  large,  the  statistics  of  interest  are  probabilities,  and  no 
normality  assumption  can  be  made  for  distributions  of  similarity  scores,  the  bootstrap  variability 
was  empirically  studied  extensively  [2,  14],  As  a result  of  our  study,  the  appropriate  number  of 
bootstrap  replications  was  determined  to  be  2000  in  our  applications. 

In  this  article,  the  total  number  of  genuine  scores  is  a little  over  60  000  and  the  total  number  of 
impostor  scores  is  a little  over  120  000.  As  demonstrated  in  our  previous  studies  of  sample  size 
in  fingerprint  applications,  if  the  numbers  of  similarity  scores  get  larger  than  these,  the 
measurement  accuracy  will  improve  little  [15].  The  research  was  carried  out  by  applying 
Chebyshev’s  inequality  to  two  metrics:  the  AURC  and  the  true  accept  rate  (TAR)  at  an 
operational  false  accept  rate  (FAR).  All  similarity  scores  were  converted  to  integers  if  they  were 
not  already.  Hence,  the  probability  distribution  functions  of  the  similarity  scores  were  all 
discrete,  and  thus  the  ROC  curve  was  not  a smooth  curve  [5]. 

The  analytical  method  using  the  Mann- Whitney  statistic  to  compute  the  estimated  SEA  (A)  of 
AURC  along  with  the  formulations  of  discrete  distribution  functions  of  genuine  scores  and 
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impostor  scores  is  shown  in  Section  2.  The  algorithm  of  the  nonparametric  two-sample  bootstrap 
method  for  calculating  the  estimated  SEb  (A)  of  AURC  and  how  to  generate  a probability 
distribution  of  SEb  (A)  are  provided  in  Section  3.  The  relative  errors  used  for  comparison  are 
presented  in  Section  4.  The  results  of  the  analytical  method  and  the  results  of  the  bootstrap 
method,  as  well  as  a comparison  of  these  two  types  of  results,  are  offered  in  Section  5,  involving 
14  different  fingerprint-image  matching  algorithms1  used  as  examples.  Finally,  conclusions  and 
discussion  can  be  found  in  Section  6. 

2 The  analytical  method  to  compute  the  estimated  SEa  (A)  of  AURC 

It  is  assumed  that  the  trapezoidal  rule  is  employed  while  computing  AURC,  and  thus  the  AURC 
is  equivalent  to  the  Mann-Whitney  statistic  directly  formed  from  the  discrete  genuine  and 
impostor  scores.  Further,  the  variance  of  the  Mann-Whitney  statistic  can  be  computed 
analytically.  Hence,  it  can  be  utilized  as  the  variance  of  AURC  [5,  and  references  therein].  First 
are  the  formulations  of  distribution  functions. 

2.1  The  formulations  of  discrete  distribution  functions  of  genuine  and  impostor  scores 

All  similarity  scores  were  converted  into  integers  if  they  were  not,  as  mentioned  in  Section  1. 
Thus,  without  loss  of  generality,  the  similarity  scores  generated  by  an  algorithm  are  expressed 
inclusively  using  the  integer  score  set  {s}  = {smjn,  smjn+l,  ...,  smax},  running  consecutively  from 
the  lowest  score  smin  to  the  highest  score  smax- 

The  genuine  score  set  is  denoted  as 

G = { mi  | m;  g {s}  and  i=  1,  ...,Ng}  , 0) 

where  Nq  is  the  total  number  of  genuine  scores.  And  the  impostor  score  set  is  expressed  as 

I = { n,  | n(  e {s}  and  i = 1,  ...,  N|}  , (2) 

where  Ni  is  the  total  number  of  impostor  scores. 

These  two  sets  of  similarity  scores  constitute  two  discrete  probability  distribution  functions, 
respectively.  Fet  P,  (s),  where  s e {s}  and  i e {G,  1},  denote  the  empirical  probabilities  of  the 
genuine  scores  and  the  impostor  scores  at  a score  s,  respectively.  It  may  very  well  be  that  some 
of  them  are  zeroes  at  some  scores  in  the  set  {s}.  Nonetheless,  the  two  distribution  functions  can 
be  expressed,  respectively,  as 

5 max 

Pi  = { P;  (s)  | V s e {s}  and  ^ Pj  (t)  = 1 } , i e {G,  1}  . (3) 


The  cumulative  discrete  probability  distribution  functions  of  genuine  scores  and  impostor  scores 
are  defined  in  this  article  to  be  the  probabilities  cumulated  from  the  highest  score  smax  down  to 
the  integer  score  s,  and  are  expressed  as 


1 Specific  hardware  and  software  products  identified  in  this  report  were  used  in  order  to  adequately  support  the 
development  of  technology  to  conduct  the  performance  evaluations  described  in  this  document.  In  no  case  does  such 
identification  imply  recommendation  or  endorsement  by  the  National  Institute  of  Standards  and  Technology,  nor 
does  it  imply  that  the  products  and  equipment  identified  are  necessarily  the  best  available  for  the  purpose. 


4 


Ci={Ci(s)=X  Pi(T)|Vse  {s}  },ie  (G,  1}  , (4) 

T=S 

where  Cj  (s),  i e {G,  I},  are  the  cumulative  probabilities  of  genuine  scores  and  impostor  scores 
at  a score  s,  respectively. 

2.2  Compute  the  estimated  AURC 


Figure  1 A schematic  drawing  of  four  points  A,  B,  C,  and  D along  with  their  coordinates  in  the  FAR-and- 
TAR  coordinate  system.  They  form  a trapezoid  at  a score  s,  and  BC  is  a segment  of  an  ROC  curve. 

After  conversion  of  similarity  scores  to  integers,  the  distributions  of  genuine  scores  and  impostor 
scores  are  all  discrete.  As  a result,  the  ROC  curve  is  no  longer  a smooth  curve.  While  cumulating 
probabilities  of  genuine  scores  and  impostor  scores  from  the  highest  similarity  score, 
respectively,  an  ROC  curve  can  go  horizontally,  vertically,  inclined  toward  upper  right,  or  stay 
where  it  is  for  each  decrement  of  score,  depending  on  whether  P|(s)  and/or  Pg(s)  are  nonzero  or 
not.  Thus,  the  AURC  consists  of  a set  of  trapezoids,  each  of  which  is  built  by  a rectangle  and  a 
triangle  in  general.  The  trapezoid  can  be  reduced  to  a rectangle,  a vertical  line,  or  a point. 

Without  loss  of  generality,  a trapezoid  is  shown  in  Figure  1.  In  the  FAR-and-TAR  coordinate 
system,  at  a score  s e {s},  by  including  zero-frequency  scores,  a trapezoid  is  constructed  by  four 
points:  A (Ci  (s  + 1),  0),  B (Ci  (s  + 1),  CG  (s  + 1)),  C (Ci  (s),  CG  (s)),  and  D (Ci  (s),  0),  in 
clockwise  direction,  assuming  Ci  (smax  + 1)  = CG  (smax  + 1)  = 0.  This  boundary  condition 
corresponds  to  the  origin  of  the  FAR-and-TAR  coordinate  system,  and  will  be  applied 
throughout  the  following  discussion.  The  lengths  (Ci  ( s)  — C i (s  + 1))  (i.e..  Pi  (s))  and  (CG  (s)  - 
CG  (s  + 1))  (i.e.,  PG  (s))  form  a triangle,  and  the  lengths  (Ci  (s)  - Ci  (s  + 1))  (i.e.,  Pi  (s))  and  CG 

5 max 

(s  + 1)  (i.e.,  ^ PG  (t))  create  a rectangle.  As  a consequence,  the  estimated  AURC  can  be 

r=y+l 

calculated  as, 
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trapezoid  (s) 


a-  I 


5 min  s min 

= ^ triangle  (s)  + ^ rectangle  (s) 

s=s  max  5=5  max 


5 min  i 5 max 

= Z P[(S)X  [ - X P0(S)+  X ftWl 

s=smax  x.  r=i+l 


(5) 


Note  that  the  summation  runs  consecutively  in  the  descending  order  from  smax  to  smjn,  including 

5 max 

zero-frequency  scores,  and  ^ = 0 is  assumed  according  to  the  above  boundary  condition. 

r=.smax+l 

This  notation  will  be  applied  throughout  the  following  discussion. 

2.3  Relate  AURC  to  the  Mann-Whitney  statistic 

In  order  to  relate  AURC  to  the  Mann-Whitney  statistic,  the  order  relations  among  similarity 
scores  are  established  as  follows.  All  the  N]  scores  in  the  impostor  score  set  I in  Eq.  (2)  are 
compared  with  all  the  Nq  scores  in  the  genuine  score  set  G in  Eq.  (1).  It  counts  1,  Vi,  or  zero 
depending  whether  an  impostor  score  Si  is  less  than,  equal  to,  or  greater  than  a genuine  score  sg- 
This  rule  can  be  expressed  as 

{1  if  si  < sG 

'/2  if  si  = sG  (6) 

0 if  Si  > sg 


After  converting  probabilities  of  genuine  and  impostor  scores  in  Eq.  (5)  back  to  frequencies  and 
by  including  zero-frequency  scores,  the  first  term  in  Eq.  (5)  shows  the  total  number  of  score  pairs 
in  which  the  impostor  score  is  equal  to  the  genuine  score,  weighted  by  Vi  and  divided  by  NgNi. 
And  the  second  term  in  Eq.  (5)  represents  the  total  number  of  score  pairs  in  which  the  impostor 
score  is  less  than  the  genuine  score,  weighted  by  1 and  divided  by  NgNi.  This  term  is  the  so 
called  “the  number  of  inversions”  in  a sequence  formed  by  impostor  and  genuine  scores  [16]. 


Finally,  the  estimated  AURC  can  be  re-written  as 

1 No  N, 


A = 


NgN, 


x X Z R(sg,si) 

SC=1  s,  =1 


(7) 


Except  for  the  coefficient,  this  is  exactly  the  Mann-Whitney  statistic  formed  by  the  genuine  and 
impostor  scores.  As  a consequence,  the  variance  of  AURC  can  be  obtained  by  computing  the 
variance  of  the  Mann-Whitney  statistic. 


2.4  Compute  the  estimated  SEA  (A)  of  AURC 


The  variance  of  the  Mann-Whitney  statistic  can  be  computed  analytically  and  it  is  utilized  as  the 
variance  of  AURC.  To  do  so,  two  more  cumulative  probability  distribution  functions  are 
required.  One  is 
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(8) 


Qg  = { Qg  (s)  = X Pg  (T)  | v s e {s}  } . 

r=i+l 

The  other  one  is 

Qi  = { Qi  (s)  = X Pi  (t)  | V s e {s}  } (9) 

t=s  min 

smin-1 

where  another  boundary  condition  X = 0 is  assumed.  Note  that  the  cumulation  of 

r=xmin 

probabilities  is  taken  place  from  smax  down  to  s + 1 with  respect  to  genuine  scores  in  Eq.  (8),  and 
from  smin  up  to  s - 1 on  impostor  scores  in  Eq.  (9). 

The  probability  Bggi,  that  two  randomly  chosen  genuine  matches  will  obtain  higher  similarity 
scores  than  one  randomly  chosen  impostor  match,  can  be  written  as 

5 max  i 

Bggi  = X Pi  (s)  x [Qg“  (s)  + Qg  (s)  x Pg  (s)  + - x Pg’ (s)  ] (10) 

s—s  min 

And  the  probability  Bug,  that  one  randomly  chosen  genuine  match  will  get  higher  similarity 
score  than  two  randomly  chosen  impostor  matches,  can  be  expressed  as 

X max  i 

Bug  — X Pg(s)  x [Qi2  (s)  + Qi  (s)  x Pi  (s)+ - x Pf  (s)  ] (11) 

5=xmin 

Finally,  the  analytical  estimator  of  SE  of  AURC  can  be  computed  as 

SEa  (A)  = square  root  { 1 - x [ A ( 1 -A)  + (NG-  1)  (BGgi  - A2) 

IN  G -IaI  i v 

+ (Nl-l)(Bll0-A2)]  1 

3 The  bootstrap  method  to  compute  the  estimated  SEb  (A)  of  AURC 

3.1  The  algorithm  of  the  nonparametric  two-sample  bootstrap  [1-3,  8-10] 

The  estimated  uncertainties  in  terms  of  SE  and  95  % confidence  interval  (Cl)  can  also  be 
computed  using  the  nonparametric  two-sample  bootstrap.  Assuming  the  data  set  is  i.i.d.,  the 
bootstrap  objects  are  individuals  in  the  data  set,  rather  than  subsets  of  the  sample  into  which  the 
sample  data  are  grouped  according  to  data  dependencies,  as  mentioned  in  Section  1.  With  such 
an  assumption,  the  algorithm  of  the  nonparametric  two-sample  bootstrap  is  as  follows. 

Algorithm  I (Nonparametric  two-sample  bootstrap) 

1:  for  i = 1 to  B do 

2:  select  Ng  scores  randomly  WR  from  G to  form  a set  {new  Ng  genuine  scores}  j 

3:  select  Ni  scores  randomly  WR  from  I to  form  a set  {new  Nj  impostor  scores}; 

4:  {new  Ng  genuine  scores}  j & {newNj  impostor  scores};  =>  statistic  A, 

5:  end  for 

6:  {A,  | i =1,...,  B } =>  SEb  and(  QB  (a/2),QB  (1  - a/2)) 

7:  end 
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where  B is  the  number  of  two-sample  bootstrap  replications  and  WR  stands  for  “with 
replacement”.  The  original  genuine  score  set  G with  Ng  scores  shown  in  Eq.  (1)  and  the  original 
impostor  score  set  I with  N|  scores  shown  in  Eq.  (2)  are  generated  by  a matching  algorithm.  As 
shown  from  Step  1 to  5,  this  algorithm  runs  B times.  In  the  i-th  iteration,  Ng  scores  are  randomly 
selected  WR  from  the  original  genuine  score  set  G to  form  a new  set  of  Ng  genuine  scores,  Ni 
scores  are  randomly  selected  WR  from  the  original  impostor  score  set  I to  form  a new  set  of  Ni 
impostor  scores,  and  then  in  Step  4 from  these  two  new  sets  of  similarity  scores  the  i-th  bootstrap 
replication  of  the  estimated  AURC,  i.e.,  A,  = AURCj,  is  generated  using  Eq.  (5). 

Finally,  after  B iterations,  as  indicated  in  Step  6,  from  the  set  {A,  | i = 1 . ...,  B } , the  estimator  of 
the  SE,  denoted  by  SEB,  i.e.,  the  sample  standard  deviation  of  the  B replications,  and  the 
estimators  of  the  a/2  100  % and  ( 1 - a/2)  100  % quantiles  of  the  bootstrap  distribution,  denoted 
by  QB(a/2)  and  QB(l-a/2),  at  the  significance  level  a can  be  calculated  [10],  The 
Definition  2 of  quantile  in  Ref.  [17]  is  adopted.  That  is,  the  sample  quantile  is  obtained  by 
inverting  the  empirical  distribution  function  with  averaging  at  discontinuities.  Thus, 
(Qb  (a/2),QB  (1  - a/2))  stands  for  the  estimated  bootstrap  (1  - a)  100  % CL  If  95  % Cl  is  of 
interest,  then  a is  set  to  be  0.05. 

The  remaining  issue  is  to  determine  how  many  iterations  this  bootstrap  algorithm  needs  to  run. 
i.e.,  what  the  number  of  the  nonparametric  two-sample  bootstrap  replications  is,  in  order  to 
reduce  the  bootstrap  variance  and  ensure  the  accuracy  of  the  computation.  As  stated  in  Section  1, 
based  on  our  empirical  bootstrap  variability  studies,  the  appropriate  number  of  bootstrap 
replications  B in  our  applications  was  determined  to  be  2000  [2,  14]. 

3.2  Generate  a probability  distribution  of  SEB  (A) 

The  analytical  approach  of  computing  the  estimated  SEa  (A)  of  AURC  using  the  Mann- Whitney 
statistic  is  a deterministic  process,  and  thus  the  analytical  solution  is  unique.  On  the  other  hand, 
as  pointed  out  in  Section  1 , the  bootstrap  method  is  a stochastic  process.  It  can  generate  different 
results  from  different  runs,  and  some  results  may  be  more  probable  and  others  less.  Hence,  the 
bootstrap  estimators  SEb  (A)  of  AURC  constitute  a probability  distribution.  Here  is  an 
Algorithm  for  generating  such  a distribution. 

Algorithm  II  (Generating  a probability  distribution) 

1:  for  i = 1 to  L do 
2:  for  j = 1 to  B do 

3:  select  Ng  scores  randomly  WR  from  G to  form  a set  {new  Nq  genuine  scores} , 

4:  select  Nj  scores  randomly  WR  from  I to  form  a set  {new  N]  impostor  scores}, 

5:  {new  Ng  genuine  scores},  & [new  Ni  impostor  scores  }j  =>  statistic  Am 

6:  end  for 

7:  { Ajj  | j = 1,  ...,  B } =>  SEei  (A) 

8:  end  for 

9:  SEb  (A)  = { SEB  i (A)  | i = 1,  ...,  L } =>  estimators  of  mean,  median,  68.27  % CL  & 95  % Cl 
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10:  end 


where  L is  the  number  of  Monte  Carlo  iterations  and  B is  the  number  of  bootstrap  replications. 
As  a matter  of  fact,  in  Algorithm  II,  from  Step  2 to  7 is  the  Algorithm  I shown  in  Section  3.1, 
which  computes  the  i-th  SEb  i (A)  of  AURC  using  the  nonparametric  two-sample  bootstrap  and 
runs  L iterations  as  indicated  in  Step  1.  As  shown  in  Step  9,  L estimated  SEb  i (A)  of  AURC  are 
created  and  form  a set  SEb  (A).  Subsequently  from  this  set,  the  estimated  mean,  median, 
68.27  % Cl  and  95  % Cl  of  the  distribution  of  estimated  SEb  (A)  of  AURC  can  be  calculated. 
The  Cl  can  be  obtained  using  the  Definition  2 of  quantile  in  Ref.  [17], 


SEb  (A) 

Number  of  Iterations  L 

100 

200 

300 

400 

500 

Alg.  A 

Min. 

0.0001289 

0.0001288 

0.0001278 

0.0001262 

0.0001279 

Max. 

0.0001393 

0.0001413 

0.0001395 

0.0001385 

0.0001418 

Range 

0.0000104 

0.0000125 

0.0000118 

0.0000123 

0.0000140 

Alg.  B 

Min. 

0.0004595 

0.0004560 

0.0004560 

0.0004574 

0.0004560 

Max. 

0.0004978 

0.0004978 

0.000501 1 

0.0004984 

0.0004963 

Range 

0.0000383 

0.0000418 

0.0000451 

0.0000410 

0.0000403 

Table  1 High-accuracy  Algorithm  A’s  and  low-accuracy  Algorithm  B’s  minimum,  maximum,  and  range  of  L 
estimated  SEB  (A)  of  AURC,  where  the  number  of  iterations  L was  set  to  be  from  100  up  to  500  at  intervals  of 
100,  while  the  number  of  bootstrap  replications  B was  set  to  be  2000. 


In  Algorithm  II,  the  number  of  bootstrap  replications  B was  set  to  be  2000,  as  discussed  above. 
Then  the  next  question  is  how  to  determine  the  number  of  iterations  L.  Two  fingerprint-image 
matching  algorithms,  high-accuracy  A and  low-accuracy  B,  were  taken  to  be  examples.  The 
number  of  iterations  L was  set  to  be  from  100  up  to  500  at  intervals  of  100.  Then  the  minimum, 
maximum,  and  range  of  L estimated  SEb  (A)  of  AURC  were  calculated  and  are  shown  in  Table 
1.  If  the  accuracy  is  up  to  the  5th  decimal  place,  the  minimum,  maximum,  and  range  of  SEb  (A) 
of  AURC  for  high-accuracy  Algorithm  A across  all  five  different  numbers  of  iterations  are 
rounded  to  0.00013,  0.00014,  and  0.00001,  respectively;  and  those  for  low-accuracy  Algorithm 
B are  rounded  to  0.00046,  0.00050,  and  0.00004  (except  one),  respectively.  As  a result,  the 
discrepancies  among  the  results  from  100  runs  to  500  runs  are  small. 

Further,  in  order  to  obtain  statistically  meaningful  estimated  Cl,  the  number  of  SEb  (A)  of 
AURC,  i.e.,  the  number  of  iterations  L,  must  be  quite  large.  For  instance,  in  order  to  obtain  95  % 
Cl,  there  are  only  about  two  instances  located  in  each  end  of  the  distribution  for  100  SEb  (A), 
however  there  are  about  12  instances  for  500  SEB  (A).  As  a consequence,  the  number  of 
iterations  was  set  to  be  500.  In  other  words,  for  each  matching  algorithm,  500  estimated  SEb  (A) 
will  be  generated  to  constitute  a probability  distribution,  and  each  of  500  estimators  is  computed 
using  the  nonparametric  two-sample  bootstrap. 

The  distribution  of  estimated  SEb  (A)  of  AURC  for  the  matching  Algorithm  A is  shown  in 
Figure  2,  where  the  red  triangle  stands  for  the  analytical  result,  the  blue  diamonds  are  the  two 
bounds  of  68.27  % Cl,  and  the  green  circles  represent  the  two  bounds  of  95  % Cl.  It  is  also 
shown  in  Figure  2 that  for  Algorithm  A the  analytical  estimator  SEA  (A)  of  AURC  is  very  close 
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to  the  mean  as  well  as  the  median  of  the  distribution  of  500  estimated  SEb  (A),  which  are  all 
approximately  equal  to  0.0001336  (see  Table  2 in  Section  5.1  referred  to  as  Algorithm  3). 


SEb  (A)  of  AURC 


Figure  2 The  distribution  of  500  estimators  of  SEB  (A)  of  AURC  computed  using  the  nonparametric  two- 
saniple  bootstrap  for  the  matching  Algorithm  A.  The  red  triangle  stands  for  the  analytical  result,  the  blue 
diamonds  are  the  two  bounds  of  68.27  % Cl,  and  the  green  circles  represent  the  two  bounds  of  95  % Cl. 

4 The  relative  errors  used  for  comparison 

While  comparing  the  bootstrap  results  with  the  unique  analytical  estimator  SEA  (A)  of  AURC, 
the  comparison  is  quantified  by  the  relative  error  in  order  to  take  into  account  the  impact  of  the 
magnitude  of  the  analytical  result.  The  estimated  relative  error  f)  is  defined  as 

f|  = I X - SEa  (A)  I / SEa  (A)  x 100  % (13) 

where  SEA  (A)  is  the  analytical  estimator  of  SE  of  AURC  computed  using  Eq.  (12)  in  Section 

2.4,  and  X is  one  of  estimated  quantities  which  describe  the  probability  distribution  of  bootstrap 
estimated  SEb  (A)  of  AURC. 

As  pointed  out  in  Section  1,  the  bootstrap  method  is  a stochastic  process.  While  performing  the 
comparisons  involving  a distribution,  it  is  not  enough  to  just  pick  one  bootstrap  result  from  a 
random  run.  In  order  to  take  account  of  the  variance  of  stochastic  process,  not  only  should  the 
estimated  mean  and  median  of  the  distribution  be  compared  with  the  analytical  result,  but  the 
upper  bound  and  lower  bound  of  68.27  % Cl  (corresponding  to  one  standard  deviation)  as  well 
as  the  two  bounds  of  95  % Cl  (corresponding  to  1.96  standard  deviation)  of  the  distribution 
should  also  be  compared.  While  comparing  the  estimated  Cl  with  the  analytical  result  SEA  (A) 
of  AURC,  the  larger  one  between  the  two  relative  errors  using  the  upper  bound  and  the  lower 
bound  of  Cl,  respectively,  will  be  employed.  Notice  that  with  probability  about  27  % the 
bootstrap  estimators  of  SE  can  fall  in  between  68.27  % Cl  and  95  % Cl  of  the  estimated  SEs. 
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The  relative  errors  of  the  estimated  SEb  (A)  of  AURC  with  respect  to  the  analytical  estimated 
SEa  (A)  using  the  estimated  mean,  median,  68.27  % Cl,  and  95  % Cl  of  the  distribution  of  SEb 
(A)  are  denoted  by  f)  , f|v , , and  f|  , respectively,  in  the  following  text. 


5 Results 

5.1  The  analytical  results  and  bootstrap  results 


Alg. 

AURC 

SEa  (A) 

Distribution  of  estimated  SEB  (A) 

Mean 

Median 

68.27  % CI 

95  % CI 

1 

0.9985568 

0.0001242 

0.0001083 

0.0001083 

(0.0001066, 

0.0001100) 

(0.0001048, 

0.0001116) 

2 

0.9982568 

0.0001231 

0.0001231 

0.0001231 

(0.0001209, 

0.0001251) 

(0.0001193, 

0.0001274) 

3 

0.9982322 

0.0001336 

0.0001336 

0.0001336 

(0.0001316, 

0.0001356) 

(0.0001297, 

0.0001377) 

4 

0.9973597 

0.0001463 

0.0001465 

0.0001466 

(0.0001442, 

0.0001489) 

(0.0001422, 

0.0001507) 

5 

0.9967486 

0.0001695 

0.0001695 

0.0001695 

(0.0001668, 

0.0001723) 

(0.0001641, 

0.0001752) 

6 

0.9943234 

0.0002472 

0.0002464 

0.0002463 

(0.0002427, 

0.0002505) 

(0.0002373, 

0.0002541) 

7 

0.9939199 

0.0002670 

0.0002435 

0.0002436 

(0.0002395, 

0.0002473) 

(0.0002362, 

0.0002517) 

8 

0.9929374 

0.0002579 

0.0002530 

0.0002528 

(0.0002486, 

0.0002572) 

(0.0002457, 

0.0002607) 

9 

0.9923011 

0.0002656 

0.0002605 

0.0002606 

(0.0002564. 

0.0002645) 

(0.0002526, 

0.0002682) 

10 

0.9914864 

0.0002742 

0.0002728 

0.0002726 

(0.0002685, 

0.0002770) 

(0.0002636, 

0.0002815) 

11 

0.9846023 

0.0003928 

0.0003664 

0.0003666 

(0.0003601, 

0.0003725) 

(0.0003548, 

0.0003784) 

12 

0.9845747 

0.0004343 

0.0004341 

0.0004342 

(0.0004279, 

0.0004404) 

(0.0004206, 

0.0004480) 

13 

0.9818637 

0.0003910 

0.0003912 

0.0003914 

(0.0003847, 

0.0003974) 

(0.0003781, 

0.0004024) 

14 

0.9729011 

0.0004781 

0.0004783 

0.0004779 

(0.0004711, 

0.0004860) 

(0.0004641. 

0.0004931) 

Table  2 The  estimated  AURC,  the  unique  analytical  SEA  (A),  and  the  estimated  mean,  median,  68.27  % Cl, 
and  95  % Cl  of  the  probability  distribution  of  estimated  SEB  (A)  for  14  matching  algorithms.  The 
distribution  was  generated  by  500  runs. 


To  show  both  analytical  results  and  bootstrap  results,  14  fingerprint-image  matching  algorithms 
were  taken  as  examples.  The  estimated  AURC,  the  unique  analytical  SEA  (A),  and  the  estimated 
mean,  median,  68.27  % Ci,  and  95  % Cl  of  the  probability  distribution  of  estimated  SEb  (A)  for 
14  matching  algorithms  are  shown  in  Table  2.  The  distribution  was  generated  by  500  runs.  Some 
matching  algorithms  are  of  relatively  high  accuracy  and  some  are  of  relatively  low  accuracy,  as 
indicated  by  their  estimated  AURC.  The  larger  the  estimated  AURC  is,  the  more  accurate  the 
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matching  algorithm  is  [5,  and  references  therein].  In  Table  2,  Algorithms  3 and  14  are 
Algorithms  A and  B employed  in  Section  3.2,  respectively. 

In  order  to  show  the  difference,  seven  decimal  places  were  kept.  Indeed,  in  our  real  computation, 
many  more  decimal  places  were  kept  in  the  intermediate  steps  of  calculations.  It  is  noticed  that 
most  analytical  estimators  SEa  (A)  fall  in  the  estimated  95  % Cl  of  the  distributions  of  SEb  (A), 
except  Algorithms  1,  7,  and  11.  This  is  related  to  the  characteristics  of  the  distributions  of 
genuine  scores  and  impostor  scores. 

For  these  three  algorithms,  there  are  huge  stand-alone  peaks  at  the  lowest  impostor  score,  which 
occupy  98.54  %,  97.15  %,  and  80.02  % of  impostor  population.  For  other  matching  algorithms, 
if  there  is  a stand-alone  peak,  it  does  not  occupy  larger  than  50  % of  the  population.  These 
extremely  large  peaks  of  impostor  distribution  at  the  lowest  score  can  cause  a very  large  portion 
of  ROC  curve  at  the  top  part  to  be  formed  by  a long  straight  line  segment  (i.e.,  the  ROC  curve 
jumps  from  one  point  to  the  next  one  by  a large  distance).  They  might  impede  bootstrap  to 
function  well. 

As  indicated  in  Section  3.1,  the  estimated  Cl  in  Table  2 were  all  obtained  using  the  Definition  2 
of  quantile  in  Ref.  [17].  In  the  meantime,  they  can  also  be  computed  by  assuming  that  the 
probability  distribution  of  SEb  (A)  for  each  matching  algorithm  is  normal.  The  estimated  SEs  of 
the  distribution  of  SEb  (A)  for  Algorithms  1 through  14  are  0.00000171,  0.00000208, 
0.00000206,  0.00000227,  0.00000278,  0.00000411,  0.00000388,  0.00000400,  0.00000406, 
0.00000452,  0.00000602,  0.00000656,  0.00000622,  and  0.00000731,  respectively. 

The  estimated  95  % CIs  calculated  in  these  two  ways  do  match  at  least  up  to  the  fifth  decimal 
place.  Generally  speaking,  the  more  accurate  the  matching  algorithms  are,  the  more  decimal 
places  they  do  match.  For  example,  for  high-accuracy  Algorithm  2,  the  estimated  95  % Cl  using 
the  quantile  method  is  (0.0001 193,  0.0001274)  as  shown  in  Table  2,  and  the  95  % Cl  assuming 
normal  distribution  is  (0.0001190,  0.0001272)  using  the  estimated  mean  0.0001231  and  the 
estimated  SE  0.00000208.  This  indicates  that  the  distributions  of  the  estimated  SEb  (A)  of 
AURC  can  be  assumed  to  be  normal. 

5.2  The  comparison  of  two  types  of  results  using  relative  error 

In  Table  3 are  shown  the  relative  errors  (%)  qM , ijv , rp  , and  f|  of  SEb  (A)  with  respect  to  the 

analytical  estimated  SEa  (A)  using  the  estimated  mean,  median,  68.27  % Cl,  and  95  % Cl  of  the 
distribution  of  estimated  SEb  (A)  of  AURC,  respectively,  for  14  matching  algorithms.  The 
corresponding  box  diagrams  of  relative  errors  of  14  matching  algorithms  are  depicted  in  Figure 
3.  It  is  obvious  that  there  are  three  outliers  that  correspond  to  Algorithms  1,  7,  and  11, 
respectively.  This  is  consistent  with  the  discussion  in  Section  5.1. 

For  those  random  runs  using  the  nonparametric  two-sample  bootstrap,  the  results  of  SEs  that 
would  be  obtained  more  probably  than  others  are  those  at  the  estimated  mean,  median,  and 
within  the  68.27  % Cl  of  the  distribution  of  estimated  SEb  (A).  As  discussed  in  Section  4,  the 
bootstrap  estimators  of  SE  can  fall  in  between  68.27  % Cl  and  95  % Cl  with  probability  about  27 


%.  In  other  words,  the  relative  errors  r)u,  f)v,  and  fp,  defined  in  Section  4,  may  be  more 
probable  than  the  relative  error  f|  . 


Alg. 

Relative  Errors  (%)  of  SEB  (A)  ! 

* 

TV 

Ov 

0c 

1 

12.83 

12.79 

14.17 

15.60 

2 

0.05 

0.02 

1.74 

3.55 

3 

0.02 

0.03 

1.48 

3.06 

4 

0.14 

0.21 

1.75 

3.00 

5 

0.03 

0.03 

1.66 

3.37 

6 

0.34 

0.38 

1.83 

3.99 

7 

8.80 

8.76 

10.29 

11.51 

8 

1.91 

1.97 

3.60 

4.74 

9 

1.93 

1.90 

3.47 

4.91 

10 

0.53 

0.61 

2.08 

3.88 

11 

6.71 

6.66 

8.32 

9.65 

12 

0.05 

0.04 

1.48 

3.16 

13 

0.04 

0.09 

1.62 

3.30 

14 

0.04 

0.05 

1.64 

3.14 

Table  3 Relative  errors  (%)  f|u  , f|v,  f]  , and  f|  Of  SEB  (A)  using  the  estimated  mean,  median,  68.27  % 
Cl,  and  95  % Cl  of  the  distribution  of  SEB  (A),  respectively,  for  14  matching  algorithms. 
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Figure  3 Box  diagrams  of  14  relative  errors  of  SEB  (A)  using  the  estimated  mean,  median,  68.27  % Cl,  and 
95  % Cl  of  the  distribution  of  estimated  SEB  (A),  respectively.  There  are  three  outliers. 
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Include 

three 

outlines 

Relative  Errors  (%)  of  SEB  (A) 

% 

Tlv 

Mean 

2.39 

2.39 

3.94 

5.49 

Median 

0.24 

0.30 

1.79 

3.71 

Table  4 The  estimated  mean  and  median  of  14  relative  errors  (%)  of  SEB  (A)  using  the  estimated  mean, 
median,  68.27  % Cl,  and  95  % Cl  of  the  distribution  of  SEB  (A),  respectively,  if  three  outliers  are  included. 


Exclude 

three 

outlines 

Relative  Errors  (%)  of  SEB  (A) 

% 

0v 

0 c 

Mean 

0.46 

0.48 

2.03 

3.65 

i Median 

0.05 

0.09 

1.74 

3.37 

Table  5 The  estimated  mean  and  median  of  11  relative  errors  (%)  of  SEB  (A)  using  the  estimated  mean, 
median,  68.27  % Cl,  and  95  % Cl  of  the  distribution  of  SEB  (A),  respectively,  if  three  outliers  are  excluded. 

Moreover,  it  is  shown  in  Figure  3 that  all  four  distributions  are  skewed.  Thus,  the  median  of  the 
distribution  is  more  important  than  the  mean.  Hence,  the  estimated  mean  and  median  of  14 
relative  errors  (%)  of  SEb  (A)  using  the  estimated  mean,  median,  68.27  % Cl,  and  95  % Cl  of 
the  distribution  of  estimated  SEb  (A),  respectively,  are  shown  in  Table  4,  where  three  algorithms 
as  outliers  are  included.  Those  excluding  three  outliers  are  presented  in  Table  5. 

If  including  three  outliers,  the  worst  relative  error  of  SEb  (A)  is  5.49  % that  is  related  to  a bound 
of  the  95  % Cl  of  the  distribution,  but  the  median  of  14  relative  errors  f|v  using  the  median  of 

the  distribution  of  estimated  SEb  (A)  for  each  matching  algorithm  is  0.30  %.  If  excluding  three 
outliers,  they  are  3.65  % and  0.09  %,  respectively.  As  a result,  the  discrepancies  between  the 
estimated  SEb  (A)  computed  using  the  nonparametric  two-sample  bootstrap  and  the  analytically 
estimated  SEA  (A)  using  the  Mann- Whitney  statistic  are  quite  small  especially  for  those  random 
bootstrap  runs  obtained  more  probably.  Subsequently,  this  validates  the  two-sample  bootstrap 
method  on  large  datasets. 

6 Conclusions  and  discussion 

The  estimated  SE  of  AURC  was  computed  analytically  using  the  Mann- Whitney  statistic  if  the 
trapezoidal  rule  is  employed,  as  well  as  numerically  using  the  nonparametric  two-sample 
bootstrap  method.  The  analytical  approach  is  a deterministic  process,  and  thus  its  estimated  SEA 
(A)  is  unique.  However,  the  bootstrap  method  is  a stochastic  process,  and  thus  its  estimators  of 
SEb  (A)  constitute  a distribution.  In  order  to  take  the  variance  of  such  a process  into 
consideration,  the  estimated  mean,  median,  68.27  % Cl,  and  95  % Cl  of  the  distribution  of 
estimated  SEb  (A)  of  AURC  are  compared  with  the  analytical  SEA  (A)  for  each  matching 
algorithm.  While  comparing  an  estimated  Cl  with  the  analytical  result,  the  relative  error  is 
defined  to  be  the  larger  one  between  using  the  upper  bound  and  the  lower  bound  of  Cl. 
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14  matching  algorithms,  including  three  outliers,  were  taken  as  examples.  Therefore,  in  each 
case,  i.e.,  using  mean,  median,  68.27  % Cl  and  95  % Cl,  respectively,  14  relative  errors  were 
generated.  The  mean  and  median  of  such  14  relative  errors  were  created  as  well.  All  such  means 
and  medians,  with  or  without  three  outliers,  were  presented.  It  was  found  that  the  discrepancies 
between  the  bootstrap  estimated  SEb  (A)  and  the  analytically  estimated  SEA  (A)  are  quite  small 
especially  for  those  random  bootstrap  runs  obtained  more  probably. 

As  a consequence,  this  validates  the  two-sample  bootstrap  method  on  large  datasets.  In  the 
meantime,  the  nonparametric  two-sample  bootstrap  was  carried  out  with  the  i.i.d.  assumption  for 
the  datasets  in  this  article.  Hence,  it  shows  again  that  our  large  government  data  bases  used  for 
developing  similarity  scores  in  fingerprint  technology  have  no  dependencies.  As  a matter  of  fact, 
from  the  statistical  point  of  view,  the  sample  should  be  collected  as  randomly  as  possible  in  test 
design. 

The  one-algorithm  hypothesis  testing  was  carried  out  on  each  of  14  matching  algorithms  to 
determine  whether  the  difference  between  the  estimated  mean  of  the  distribution  of  estimated 
SEb  (A)  of  AURC  and  the  analytical  SEA  (A)  as  a hypothesized  value  is  statistically  significant, 
since  the  distribution  can  be  assumed  to  be  normal  as  discussed  in  Section  5.1  [1],  It  was  found 
that  the  two-tailed  p-values  of  Algorithms  1,  7,  and  1 1 (three  outliers)  were  close  to  zero,  those 
of  Algorithms  8 and  9 were  about  20  %,  and  all  others  were  greater  than  70  %.  This  is  consistent 
with  the  observations  in  Table  2,  where  the  analytical  SEA  (A)  falls  outside  the  estimated  95  % 
Cl  for  Algorithms  1,  7,  and  1 1,  between  68.27  % Cl  and  95  % Cl  for  Algorithms  8 and  9,  and 
inside  68.27  % Cl  for  all  other  algorithms.  Hence,  generally  speaking,  the  difference  is  not  real. 

An  extremely  large  stand-alone  peak  of  distribution  of  similarity  scores,  which  occupies  a very 
large  portion  of  population,  can  impede  the  bootstrap  functioning  well,  as  shown  in  Section  5. 
This  might  be  because  the  randomness  of  resampling  similarity  scores  from  such  a distribution 
could  be  affected  by  the  huge  stand-alone  peak.  The  objective  of  creating  such  a peak  at  the 
lowest  (and/or  highest)  similarity  score  is  to  separate  the  distributions  of  genuine  scores  and 
impostor  scores  as  far  as  possible  so  as  to  increase  the  matching  accuracy  [5,  11],  This  is  one  of 
techniques  employed  by  some  matching  algorithms.  Nevertheless,  the  worst  relative  error 
15.60  % that  is  related  to  a bound  of  the  95  % Cl  of  the  distribution  as  shown  in  Table  3 is 
relatively  large  in  comparison  with  others  in  the  table,  but  it  is  acceptable  in  real  numerical 
computation. 

All  the  tests  performed  in  this  article  were  on  large  datasets  with  tens  and  hundreds  of  thousands 
of  genuine  scores  and  impostor  scores.  A simple  test  on  small  medical  datasets  from  Ref.  [7]  was 
also  conducted,  in  which  there  were  54  genuine  scores  and  58  impostor  scores  for  both  Modality 
1 and  2.  It  was  based  on  a random  run  of  bootstrap  method  rather  than  generating  a distribution 
of  estimated  SEB  (A).  However,  the  number  of  bootstrap  replications  was  set  to  be  2000,  as 
discussed  in  Section  1.  For  Modality  1,  the  estimated  AURC  was  0.882822,  the  analytical  SEA 
(A)  was  0.032606,  and  the  bootstrap  SEb  (A)  was  0.031943.  Thus,  the  relative  error  was  2.03  %. 
For  Modality  2,  they  were  0.930236,  0.026434,  and  0.025059,  respectively.  Hence,  the  relative 
error  was  5.20  %.  They  are  all  small  relative  errors. 
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In  comparison  of  datasets,  it  seems  that  the  larger  the  dataset,  the  more  accurate  the  bootstrap 
method.  For  small  datasets,  the  statistics  of  interest  employed  in  operational  ROC  analysis,  such 
as  TAR,  EER,  detection  cost  function,  etc.,  can  lose  statistical  meaning  anyway,  because  of  the 
small  numbers  of  genuine  scores  and  impostor  scores.  Under  such  circumstances,  the  metric 
AURC  can  be  used  and  its  estimated  SE  can  be  computed  analytically. 

For  large  datasets,  from  the  operational  perspective,  the  metrics,  such  as  TAR,  EER,  detection 
cost  function,  etc.,  must  be  employed.  And  as  pointed  out  in  Section  1,  it  is  hard  to  calculate 
uncertainties  of  such  statistics  of  interest  without  using  the  nonparametric  two-sample  bootstrap 
method.  Therefore,  the  validation  of  such  an  approach  on  large  datasets  provides  a foundation  for 
computing  uncertainties  in  operational  ROC  analysis. 
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